attention density
A Supplementary Materials
A.1 Dataset Description We describe the additional details of each dataset in the followings. And we use the first 90% for the training set and the last 10% as the validation set. And we leave the last 210 days as the test set. We further evaluate the sensitivities of different hyper-parameters. The distribution of attention densities of different α is shown in Figure 2.
- Asia > Middle East > Israel (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Asia > Middle East > Israel (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Data Science (0.93)
A Supplementary Materials
A.1 Dataset Description We describe the additional details of each dataset in the followings. And we use the first 90% for the training set and the last 10% as the validation set. And we leave the last 210 days as the test set. We further evaluate the sensitivities of different hyper-parameters. The distribution of attention densities of different α is shown in Figure 2.
- Asia > Middle East > Israel (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Asia > Middle East > Israel (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Data Science (0.93)
Hidden Dynamics of Massive Activations in Transformer Training
Gallego-Feliciano, Jorge, McClendon, S. Aaron, Morinelli, Juan, Zervoudakis, Stavros, Saravanos, Antonios
Massive activations are scalar values in transformer hidden states that achieve values orders of magnitude larger than typical activations and have been shown to be critical for model functionality. While prior work has characterized these phenomena in fully trained models, the temporal dynamics of their emergence during training remain poorly understood. We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins.
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost
Cho, Sungjun, Min, Seonwoo, Kim, Jinwoo, Lee, Moontae, Lee, Honglak, Hong, Seunghoon
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a sparse variant of softmax such as $\alpha$-entmax. Unfortunately, the first group lacks adaptability to data while the second still requires quadratic cost in training. In this work, we propose SBM-Transformer, a model that resolves both problems by endowing each attention head with a mixed-membership Stochastic Block Model (SBM). Then, each attention head data-adaptively samples a bipartite graph, the adjacency of which is used as an attention mask for each input. During backpropagation, a straight-through estimator is used to flow gradients beyond the discrete sampling step and adjust the probabilities of sampled edges based on the predictive loss. The forward and backward cost are thus linear to the number of edges, which each attention head can also choose flexibly based on the input. By assessing the distribution of graphs, we theoretically show that SBM-Transformer is a universal approximator for arbitrary sequence-to-sequence functions in expectation. Empirical evaluations under the LRA and GLUE benchmarks demonstrate that our model outperforms previous efficient variants as well as the original Transformer with full attention. Our implementation can be found in https://github.com/sc782/SBM-Transformer .
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Kernel Deformed Exponential Families for Sparse Continuous Attention
Moreno, Alexander, Nagesh, Supriya, Wu, Zhenke, Dempsey, Walter, Rehg, James M.
Attention mechanisms take an expectation of a data representation with respect to probability weights. This creates summary statistics that focus on important features. Recently, Martins et al. (2020; 2021) proposed continuous attention mechanisms, focusing on unimodal attention densities from the exponential and deformed exponential families: the latter has sparse support. Farinhas et al. (2021) extended this to use Gaussian mixture attention densities, which are a flexible class with dense support. In this paper, we extend this to two general flexible classes: kernel exponential families (Canu & Smola, 2006) and our new sparse counterpart kernel deformed exponential families. Theoretically, we show new existence results for both kernel exponential and deformed exponential families, and that the deformed case has similar approximation capabilities to kernel exponential families. Experiments show that kernel deformed exponential families can attend to multiple compact regions of the data domain. Attention mechanisms take weighted averages of data representations (Bahdanau et al., 2015), where the weights are a function of input objects. These are then used as inputs for prediction. Discrete attention 1) cannot easily handle data where observations are irregularly spaced 2) attention maps may be scattered, lacking focus.
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)